现有的最先进的(SOTA)视频显着对象检测(VSOD)模型已广泛遵循短期方法,该方法通过仅考虑当前连续的有限帧而动态地确定空间和时间显着性融合之间的平衡。但是,短期方法论具有一个关键限制,这与我们视觉系统的真实机制相抵触,这是一种典型的长期方法。结果,故障案例不断出现在当前的SOTA模型的结果中,而短期方法论成为主要的技术瓶颈。为了解决这个问题,本文提出了一种新颖的VSOD方法,该方法以完整的长期方式执行了VSOD。我们的方法将顺序vSOD(一个顺序任务)转换为数据挖掘问题,即将输入视频序列分解为对象提案,然后尽可能易于挖掘出明显的对象建议。由于所有对象提案都可以同时获得,因此提出的方法是一种完整的长期方法,可以减轻植根于常规短期方法的一些困难。此外,我们设计了一个在线更新方案,该方案可以掌握显着对象的最具代表性和可信赖的模式概况,并使用丰富的细节输出框架显着图,并在空间和时间上平滑。所提出的方法在五个广泛使用的基准数据集上几乎优于所有SOTA模型。
translated by 谷歌翻译
随着高动态范围(HDR)摄影的日益普及和可访问性,用于动态范围压缩和中等呈现的音调映射操作员(TMO)实际上是要求的。在本文中,我们开发了一种基于生物学的,计算效率和感知优化的两阶段神经网络图像TMO。在第一阶段,由人类视觉系统(HVS)早期阶段的生理学动机,我们首先将HDR图像分解为标准化的Laplacian金字塔。然后,我们使用两个轻巧的深神经网络(DNN),将这种归一化表示作为输入并估计相应LDR图像的拉普拉斯金字塔。我们通过最小化标准化的拉普拉斯金字塔距离(NLPD)来优化音调映射网络,这是一种对人类对音调映射图像质量判断的校准的感知度量。在第二阶段中,我们通过输入HDR图像``校准'',生成具有不同颜色饱和度和细节可见性的伪型曝光图像堆栈。然后,我们通过最大化MEF-SSIM的变体,这是另一个具有感知校准的度量以进行图像融合,将另一个轻巧的DNN训练将LDR图像堆叠融合到所需的LDR图像中。通过这样做,提出的TMO是完全自动的,以映射未校准的HDR图像。在一组独立的HDR图像中,我们发现我们的方法生成具有更好的视觉质量的图像,并且是本地最快的TMO之一。
translated by 谷歌翻译
虚拟现实(VR)视频(通常以360美元$^\ Circ $视频形式)由于VR技术的快速开发以及消费级360 $^\ Circ $摄像机和显示器的显着普及而引起了人们的关注。因此,了解人们如何看待用户生成的VR视频,这些视频可能会受到混乱的真实扭曲,通常是在时空和时间上局部的。在本文中,我们建立了最大的360美元$^\ Circ $视频数据库之一,其中包含502个用户生成的视频,内容丰富和失真多样性。我们捕获了139位用户的观看行为(即扫描路径),并在四个不同的观看条件下(两个起点$ \ times $ $ $ $ $两个探索时间)收集了他们的意见分数。我们对记录的数据提供了详尽的统计分析,从而产生了一些有趣的观察结果,例如观看条件对观看行为和感知质量的重大影响。此外,我们还探讨了我们的数据和分析的其他用法,包括评估360 $^\ CIRC $视频的质量评估和显着性检测的计算模型。我们已经在https://github.com/yao-yiru/vr-video-database上提供了数据集和代码。
translated by 谷歌翻译
现有的最先进的点描述符仅依赖于结构信息,从而省略纹理信息。然而,纹理信息对于我们的人类来区分场景部分至关重要。此外,基于学习的点描述符是尚不清楚原始点如何贡献到最终描述符的黑框。在本文中,我们提出了一种新的多模式融合方法,通过考虑结构和纹理信息来生成点云注册描述符。具体地,设计一种新的关注融合模块,用于提取描述符提取的加权纹理信息。此外,我们提出了一个可解释的模块来解释有助于最终描述符的原始点。我们使用描述符元素作为对目标层的丢失丢失,并将梯度视为对最终描述符的这一点的重要性。本文进一步移动了一步,以解释注册任务中的深度学习。 3DMATCH,3DLomatch和Kitti的综合实验表明,多模式融合描述符实现最先进的准确性并提高描述符的独特性。我们还表明我们的可解释模块在解释注册描述符提取时。
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, recent research generally focuses on short-distance applications (i.e., phone unlocking) while lacking consideration of long-distance scenes (i.e., surveillance security checks). In order to promote relevant research and fill this gap in the community, we collect a large-scale Surveillance High-Fidelity Mask (SuHiFiMask) dataset captured under 40 surveillance scenes, which has 101 subjects from different age groups with 232 3D attacks (high-fidelity masks), 200 2D attacks (posters, portraits, and screens), and 2 adversarial attacks. In this scene, low image resolution and noise interference are new challenges faced in surveillance FAS. Together with the SuHiFiMask dataset, we propose a Contrastive Quality-Invariance Learning (CQIL) network to alleviate the performance degradation caused by image quality from three aspects: (1) An Image Quality Variable module (IQV) is introduced to recover image information associated with discrimination by combining the super-resolution network. (2) Using generated sample pairs to simulate quality variance distributions to help contrastive learning strategies obtain robust feature representation under quality variation. (3) A Separate Quality Network (SQN) is designed to learn discriminative features independent of image quality. Finally, a large number of experiments verify the quality of the SuHiFiMask dataset and the superiority of the proposed CQIL.
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) via deep learning has attracted appealing attention for tackling domain-shift problems caused by distribution discrepancy across different domains. Existing UDA approaches highly depend on the accessibility of source domain data, which is usually limited in practical scenarios due to privacy protection, data storage and transmission cost, and computation burden. To tackle this issue, many source-free unsupervised domain adaptation (SFUDA) methods have been proposed recently, which perform knowledge transfer from a pre-trained source model to unlabeled target domain with source data inaccessible. A comprehensive review of these works on SFUDA is of great significance. In this paper, we provide a timely and systematic literature review of existing SFUDA approaches from a technical perspective. Specifically, we categorize current SFUDA studies into two groups, i.e., white-box SFUDA and black-box SFUDA, and further divide them into finer subcategories based on different learning strategies they use. We also investigate the challenges of methods in each subcategory, discuss the advantages/disadvantages of white-box and black-box SFUDA methods, conclude the commonly used benchmark datasets, and summarize the popular techniques for improved generalizability of models learned without using source data. We finally discuss several promising future directions in this field.
translated by 谷歌翻译
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of offline videos and textual query sentences. However, in real scenarios, online videos are frequently accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This inspires us to generate associated captions from offline videos to help with existing text-video retrieval methods. To do so, we propose to use the zero-shot video captioner with knowledge of pre-trained web-scale models (e.g., CLIP and GPT-2) to generate captions for offline videos without any training. Given the captions, one question naturally arises: what can auxiliary captions do for text-video retrieval? In this paper, we present a novel framework Cap4Video, which makes use of captions from three aspects: i) Input data: The video and captions can form new video-caption pairs as data augmentation for training. ii) Feature interaction: We perform feature interaction between video and caption to yield enhanced video representations. iii) Output score: The Query-Caption matching branch can be complementary to the original Query-Video matching branch for text-video retrieval. We conduct thorough ablation studies to demonstrate the effectiveness of our method. Without any post-processing, our Cap4Video achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%).
translated by 谷歌翻译